The data presented is derived from reported crimes classified according to Maryland criminal code and documented by approved police incident reports. The data about crimes do not put info about the victins and its masks the actual address not putting the exact place where the complaint occured.
Source: https://data.world/jboutros/montgomery-county-crime
Maryland County Area
Fist of all we need to import libraries needed. NumPy, Pandas, Matplotlib and Bokeh are needed to run the scripts of this notebook. Below we can see how to import these libraries and how to configure Bokeh to show chart inline (calling output_notebook() function)
In [65]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
#Importing Bokeh library, updated to version 0.12.9
from bokeh.io import push_notebook, show, output_notebook, output_file
from bokeh.layouts import row
from bokeh.plotting import figure
from bokeh.sampledata.commits import data
from bokeh.models import (
GMapPlot, GMapOptions, ColumnDataSource, Circle, DataRange1d, PanTool, WheelZoomTool, BoxSelectTool, Jitter
)
from bokeh.core.properties import field
#from bokeh.transform import jitter
#from bokeh.models.transforms import Jitter
#Inline bokeh charts
output_notebook()
In [89]:
#Loading dataset from Montgomery County complaint dataset
monty_data = pd.read_csv("MontgomeryCountyCrime2013.csv")
monty_data.head()
Out[89]:
As we described we have all theses coluns in this dataset.
In [90]:
monty_data.columns
Out[90]:
In [91]:
number_of_registries = monty_data.shape
print(number_of_registries[0])
In [92]:
#Using the agg function allows you to calculate the frequency for each group using the standard library function len.
#Sorting the result by the aggregated column code_count values, in descending order, then head selecting the top n records, then reseting the frame; will produce the top n frequent records
top = monty_data.groupby(['Class','Class Description'])['Class'].agg({"frequency": len}).sort_values("frequency", ascending=False).head(43).reset_index()
top['frequency'] = (top['frequency']/number_of_registries[0])*100
top
Out[92]:
As we can see the most common type of complaint is "Driving under influence". Unfortunatelly we do not have the type of subtance influencing the driver like alcohool and other drugs. Altought, interestingly the second most common complaint is "Possession of CDS (Controlled Dangerous Substance) marijuana/hashish". But we also can not correlate the substance of with the drives is influenced with the possession of marijuana/hashish.
The number of 43 Classes of crimes was not a magic number. These 43 classes of crimes make 80% of the total of crimes and the 20% are distributed in the other 242 classes.
The main purpose of this kind of approach is to be right to the point, where the overwhelming majority of the complaints are related. The Police Office or County's board of advisors could make decisions to prevent or mitigate these kinds of problems based on the ammount of crimes. This kind of approach will be time saving :] in the methods that cannot be fully automated, like classification of crimes in violent or not or the description of 'master classes'.
In [70]:
from decimal import *
#Configure precision
getcontext().prec = 2
parcial_perc = top['frequency'].sum()
parcial_perc = round(parcial_perc,2)
tot_classes = monty_data['Class'].value_counts(normalize=True, sort=True).shape[0]
print("The crimes above are responsible for up to " + str(parcial_perc) + "% (Pareto Analysis) of the total crimes. Performed by "+ str(top.shape[0])+" out of "+ str(tot_classes)+" classes of crimes!")
print("For precision purposes we had only " + str(round((43/285)*100,2)) + "% classes that represent 80% of the total of complaints")
For reference: https://en.wikipedia.org/wiki/Pareto_analysis
In [93]:
#Creating a master class to categorize crimes
classaux = monty_data["Class"]/100
classaux = classaux.astype(int)
classaux = classaux*100
#Inserting this new data in the dataset
monty_data["MasterClass"] = classaux
monty_data[["Class","Class Description",'MasterClass']].head(5)
Out[93]:
We can observe that line 17 and 19 below are respectively from class 1834 and 1833. Looking at its 'Class Description' both are classified as Controlle Dangerous Substance Possession and could be agregated in a major class 1800.
In [95]:
monty_data[17:25].head(3)
Out[95]:
Bellow we ranked by frequency the lista of crimes and it's 'master class'
In [117]:
#Considering the top crimes
#copy
top_classes_top = top
#Creation of a Master Class
top_classes_top['Master Class'] = 0
aux = top_classes_top['Master Class'].astype(float,copy=True)
top_classes_top['Master Class'] = aux
top_classes_top['Master Class'] = top_classes_top['Class']/100
top_classes_top['Master Class'] = top_classes_top['Master Class'].round()
top_classes_top['Master Class'] = top_classes_top['Master Class']*100
aux = top_classes_top['Master Class'].astype(int,copy=True)
top_classes_top['Master Class'] = aux
top_classes_top
Out[117]:
In [126]:
top_classes_top[['Class Description','frequency']].plot.bar()
Out[126]:
Analysing the descriptions of the crimes is common to see that the 'class description' are separated with a hyphen sign but not all master classes of crimes could be generalized to adopt the left portion of '-' description. As we can notice below the master class 2900 was classified as 'Misc.' because there is more than one type of crime related and we did not found a better derscription. Its important to notice that we only worked with the top complaints (Pareto). And because of the 20/80 analysis we only have to threat 14 'master classes'.
In [75]:
#Inserting the description of the Master Classes
top_classes_top['Master Class Description'] =''
top_classes_top[top_classes_top['Master Class'] == 600]
test_top = top_classes_top
test_top.loc[(test_top['Master Class'] == 600),'Master Class Description'] = 'LARCENY'
test_top.loc[(test_top['Master Class'] == 2900),'Master Class Description'] = 'MISC'
test_top.loc[(test_top['Master Class'] == 1400),'Master Class Description'] = 'VANDALISM'
test_top.loc[(test_top['Master Class'] == 1000),'Master Class Description'] = 'FORGERY/CNTRFT'
test_top.loc[(test_top['Master Class'] == 500),'Master Class Description'] = 'BURGLARY'
test_top.loc[(test_top['Master Class'] == 800),'Master Class Description'] = 'ASSAULT & BATTERY'
test_top.loc[(test_top['Master Class'] == 1800),'Master Class Description'] = 'CONTROLLED DANGEROUS SUBSTANCE POSSESSION'
test_top.loc[(test_top['Master Class'] == 700),'Master Class Description'] = 'THEFT'
test_top.loc[(test_top['Master Class'] == 2100),'Master Class Description'] = 'JUVENILE RUNAWAY'
test_top.loc[(test_top['Master Class'] == 2800),'Master Class Description'] = 'DRIVING UNDER THE INFLUENCE'
test_top.loc[(test_top['Master Class'] == 1900),'Master Class Description'] = 'CONTROLLED DANGEROUS SUBSTANCE IMPLMNT'
test_top.loc[(test_top['Master Class'] == 2200),'Master Class Description'] = 'LIQUOR - DRINK IN PUB OVER 21'
test_top.loc[(test_top['Master Class'] == 2400),'Master Class Description'] = 'DISORDERLY CONDUCT'
test_top.loc[(test_top['Master Class'] == 2700),'Master Class Description'] = 'TRESPASSING'
test_top
Out[75]:
It's notorious that "Driving under the influency" is the most common crime commited but when we agregate classes we could take another understading of the most common crimes. As we can see below, the master class 600, the code to 'Larceny', is the most common type of crime with 25.44% of the TOP43 complaints.
In [78]:
test_top['Master Class'].value_counts(sort=True)
Out[78]:
In [107]:
test_top.groupby(['Master Class']).sum()
Out[107]:
According to wikipedia (https://en.wikipedia.org/wiki/Violent_crime)violent crimes include, but are not limited to, this list of crimes: Typically, violent criminals includes aircraft hijackers, bank robbers, muggers, burglars, terrorists, carjackers, rapists, kidnappers, torturers, active shooters, murderers, gangsters, drug cartels, and others.
Starting from the crimes in the pareto analysis we noticed that only tree master classes could be considered violent, that are:
In [110]:
test_top['Violent crime'] = False
test_top.loc[(test_top['Master Class'] == 500),'Violent crime'] = True
test_top.loc[(test_top['Master Class'] == 800),'Violent crime'] = True
test_top.loc[(test_top['Master Class'] == 700),'Violent crime'] = True
test_top.sort_values(['Violent crime', 'frequency'], ascending=False, axis=0, kind='quicksort')
Out[110]:
From the classification made above over the most common crimes (80%) we could make a statement that only 8.53% of these kind of crimes are violent.
In [111]:
value_percentage = test_top[test_top['Violent crime'] == True]['frequency'].sum()
value_percentage = round(value_percentage,2)
print(str(value_percentage) + '% of the top43 crimes are violent')
In [130]:
valores = [100-value_percentage, value_percentage]
marcacoes = 'Não violentos', 'Violentos'
plt.pie(valores, labels=marcacoes, autopct='%1.2f%%', shadow=True)
Out[130]:
Processing data from 'Dispatch Date/Time'
In [13]:
datetime = pd.to_datetime(monty_data['Dispatch Date / Time'])#Takes too long to process
In [112]:
#Considering the top crimes
date_data = monty_data[['Dispatch Date / Time', 'Class']]
#Creation of a Master Class
date_data['Master Class'] = 0
aux = date_data['Master Class'].astype(float,copy=True)
date_data['Master Class'] = aux
date_data['Master Class'] = date_data['Class']/100
date_data['Master Class'] = date_data['Master Class'].round()
date_data['Master Class'] = date_data['Master Class']*100
aux = date_data['Master Class'].astype(int,copy=True)
date_data['Master Class'] = aux
#date_data.head(5)
date_data['Period of the day']=''
#print("-------------")
#TODO make a better version of this cell, the cast not worked, work around
For select period of times we simply filtered the column 'Dispatch Date / Time', transformed in a proper type (datetime) with
pd.to_datetime(monty_data['Dispatch Date / Time'])
extracted only the hour out of the hole datetime structure and filtered according the period of the day. So, Morning period received all complaint starting from 5AM to 12AM. Afternoon from 1AM to 6PM and Night from 7PM to 4AM as we can se after:
In [131]:
morning = datetime[(datetime.dt.hour > 5) & (datetime.dt.hour <= 12)]
afternoon = datetime[(datetime.dt.hour > 12) & (datetime.dt.hour <= 18)]
night = datetime[((datetime.dt.hour > 18) & (datetime.dt.hour <= 23)) | (datetime.dt.hour > 0) & (datetime.dt.hour <= 4)]
print("In the morning we computed "+ str(morning.shape[0]) + " crimes. There is a probability of " + str(round((morning.shape[0]/23369)*100,2)) + "% of a complaint be registered in the morning")
print("At afternoon we computed "+ str(afternoon.shape[0]) + " crimes. There is a probability of " + str(round((afternoon.shape[0]/23369)*100,2)) + "% of a complaint be registered at afternoon")
print("In the night we computed "+ str(night.shape[0]) + " crimes. There is a probability of " + str(round((night.shape[0]/23369)*100,2)) + "% of a complaint be registered in the morning")
Based on the information of the column 'Dispatch Date/Time' filtered the dates according to it's weekdays. And bellow we have that TUESDAY has the majority of occurences. That information is followed by a bar chart showing the frequency of each day.
In [135]:
day_of_the_week = datetime.dt.weekday_name
result = day_of_the_week.value_counts()
print('Tuesday is the day more probable to happen a crime')
result.to_frame
Out[135]:
In [115]:
fig, ax = plt.subplots()
ind = np.arange(0, 7)
Tue = plt.bar(0, height=(result[0]/23369)*100, color='red')
Mon = plt.bar(1, height=(result[1]/23369)*100, color='green')
Wed = plt.bar(2, height=(result[2]/23369)*100, color='blue')
Fri = plt.bar(3, height=(result[3]/23369)*100, color='black')
Thu = plt.bar(4, height=(result[4]/23369)*100, color='brown')
Sat = plt.bar(5, height=(result[5]/23369)*100, color='yellow')
Sun = plt.bar(6, height=(result[6]/23369)*100, color='orange')
ax.set_xticks(ind)
ax.set_xticklabels(['Tuesday','Monday','Wednesday','Friday','Thursday','Saturday','Sunday'])
ax.set_ylim([0, 20])
plt.xticks(ind,rotation='45')
ax.set_ylabel('Percentage')
ax.set_title('Number of crimes per day of the week')
plt.show()
Looking for a way to better see how the 'Dispatch Date/Time' is distributed according the period of the day and day of the week we found a great example of a scatter chart (https://bokeh.pydata.org/en/latest/docs/user_guide/categorical.html#adding-jitter). Unfortunatelly there is a problem importing jitter to this notebook!
In [147]:
datetime_analys = pd.to_datetime(monty_data[(monty_data['MasterClass']==500) | (monty_data['MasterClass']==600) | (monty_data['MasterClass']==800)]['Dispatch Date / Time'])#Takes too long to process
In [148]:
#output_file("categorical_scatter_jitter.html")
DAYS = ['Sun', 'Sat', 'Fri', 'Thu', 'Wed', 'Tue', 'Mon']
dow = {0 : 'Sun', 1:'Mon', 2:'Tue', 3:'Wed', 4:'Thu', 5:'Fri', 6:'Sat'}
#copy of original datetime
scatter_datetime = datetime_analys
scatter_datetime = scatter_datetime.rename('complete_date')
scatter_dayofweek = datetime_analys.dt.dayofweek
scatter_dayofweek = scatter_dayofweek.rename('day_of_week')
scatter_dayofweek=scatter_dayofweek.replace(dow)
scatter_hour = datetime_analys.dt.time
scatter_hour = scatter_hour.rename('hour')
dictionary = {'datetime': scatter_datetime, 'day': scatter_dayofweek, 'time':scatter_hour}
scatter_complete = pd.DataFrame(data=dictionary)
source = ColumnDataSource(scatter_complete)
p = figure(plot_width=950, plot_height=500, y_range=DAYS, x_axis_type='datetime',
title="Violent complaints by Time of Day (US/Central) - Montgomey County - 2013")
#p.circle(x='time', y=Jitter('day', width=0.6, range=p.y_range), source=source, alpha=0.3)
p.circle(x='time', y='day', source=source, alpha=0.1)
#p.circle(x='time', y=Jitter('day', width=0.6, mean=0, distribution='uniform', range=p.y_range,), source=source, alpha=0.3)
#p.circle(x='time', y=Jitter(name='day', width=0.6, mean=0, distribution='uniform'), source=source, alpha=0.3)
#p.circle(x='time', y='day', source=source, alpha=0.3, size=1)
#field_name, width, mean=0, distribution='uniform', range=None
#p.circle(x='time', y=field('day', Jitter(mean=0, width=0.6, distribution='uniform', range=None)), source=source, alpha=0.3)
p.xaxis[0].formatter.days = ['%Hh']
p.x_range.range_padding = 0
p.ygrid.grid_line_color = None
In [149]:
show(p,notebook_handle=True)
Out[149]:
As we can see above there is a concentration starting from aroud 7:00AM in the majority of the days. On Friday and Saturday the are fewer complaints registered.
Making the same kind of plot with the data of the 'Start Time' the is visible the difference between the two plots. I this second approach there is a shift in the time of the day as we can see below.
In [150]:
datetime_analys_start = pd.to_datetime(monty_data[(monty_data['MasterClass']==500) | (monty_data['MasterClass']==600) | (monty_data['MasterClass']==800)]['Start Date / Time'])#Takes too long to process
In [151]:
scatter_datetime_start = datetime_analys_start
scatter_datetime_start = scatter_datetime_start.rename('complete_date')
scatter_dayofweek_start = datetime_analys_start.dt.dayofweek
scatter_dayofweek_start = scatter_dayofweek_start.rename('day_of_week')
scatter_dayofweek_start = scatter_dayofweek_start.replace(dow)
scatter_hour_start = datetime_analys_start.dt.time
scatter_hour_start = scatter_hour_start.rename('hour')
dictionary_start = {'datetime': scatter_datetime_start, 'day': scatter_dayofweek_start, 'time':scatter_hour_start}
scatter_complete_start = pd.DataFrame(data=dictionary_start)
source_start = ColumnDataSource(scatter_complete_start)
p = figure(plot_width=950, plot_height=500, y_range=DAYS, x_axis_type='datetime',
title="Violent complaints by Time of Day (US/Central) - Start Time - Montgomey County - 2013")
p.circle(x='time', y='day', source=source_start, alpha=0.1)
p.xaxis[0].formatter.days = ['%Hh']
p.x_range.range_padding = 0
p.ygrid.grid_line_color = None
In [152]:
show(p,notebook_handle=True)
Out[152]:
Now is time to see how are the behaviour according to the month of the Dispatch Time. First of all, we can detect there is a lack of information from the first semester of the year, chart below. So, considering only the second half of the year the month with more complaints is OCTOBER with almost 17.5%
In [153]:
month_of_the_year = datetime.dt.month
result_month = month_of_the_year.value_counts()
print('October is the month more likely to happen a crime with a probability of '+ str((result_month[10]/23369)*100)+'%')
number_of_crimes = sum(result_month)
print('Total number of complaints: ' + str(number_of_crimes))
result_month.to_frame
fig1, ax1 = plt.subplots()
ind = np.arange(0, 13)
Oct = plt.bar(10,height=(result_month[10]/23369)*100,color='indigo')
Aug = plt.bar(8,height=(result_month[8]/23369)*100,color='cyan')
Nov = plt.bar(11,height=(result_month[11]/23369)*100,color='blue')
Sep = plt.bar(9,height=(result_month[9]/23369)*100,color='lightblue')
Dec = plt.bar(12,height=(result_month[12]/23369)*100,color='purple')
Jul = plt.bar(7,height=(result_month[7]/23369)*100,color='magenta')
ax.set_xticks(ind)
ax.set_xticklabels(['January','February','March','April','May','June','July','August','September','October','November','December'])
ax.set_ylim([0, 20])
plt.xticks(ind,rotation='45')
#plt.xlabel.
ax.set_ylabel('Percentage')
ax.set_title('Number of crimes per month of the year')
plt.show()
The dataset contains the latitude and longitude of the complaints. This data could be utilized to see in a map how the crimes are distributed.
Configuring maps and loading data about where the complaints have occured. Observe, to sucesfully configure the Google Maps you have to create an API Key (You can generate one from this site: https://developers.google.com/maps/documentation/javascript/get-api-key) and change in the line 'plot.api_key = ""'
In [154]:
map_options = GMapOptions(lat=39.151040, lng=-77.193020, map_type="roadmap", zoom=11)
plot = GMapPlot(x_range=DataRange1d(), y_range=DataRange1d(), map_options=map_options)
plot.title.text = "Montgomery County"
# For GMaps to function, Google requires you obtain and enable an API key:
#
# https://developers.google.com/maps/documentation/javascript/get-api-key
#
# Replace the value below with your personal API key:
plot.api_key = "AIzaSyBFHmpkUOfk2FtDZXHVBSUUHp6LVPmI-fs"
Load data in using read_csv function and filtering Latitude and Longitude data. After this showing a preview of the data loaded.
In [29]:
#Reference https://www.census.gov/data/tables/2016/demo/popest/total-cities-and-towns.html
#https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?fpt=table
pop_data = pd.read_csv("PEP_2016_PEPANNRES.csv",sep=';')
#Adjusting data
#Removing city configuration
pop_data['GEO.display-label'] = pop_data['GEO.display-label'].str.replace("city, Maryland","")
pop_data['GEO.display-label'] = pop_data['GEO.display-label'].str.replace("town, Maryland","")
pop_data['GEO.display-label'] = pop_data['GEO.display-label'].str.replace("village, Maryland","")
pop_data=pop_data.rename_axis({'GEO.display-label': "City"}, axis="columns")
pop_data['City'] = pop_data['City'].str.upper()
pop_data['City'] = pop_data['City'].str.strip() #Merge not working 'cause of a ' '
#Now the data was merged with data from population
pop_data_montydata = monty_data.merge(pop_data, left_on=['City'], right_on=['City'], how='inner')
pop_data[['City','respop72013']].sort_values(by=['respop72013'],ascending=False)
Out[29]:
In [30]:
monty_data.columns
#monty_data[monty_data['City']].value_counts
cities_data = pd.DataFrame({'freq':monty_data['City'].value_counts(normalize=True,sort=True,ascending=False)*100,'count':monty_data['City'].value_counts(sort=True,ascending=False)})
cities_data['per_hundred_thousand'] = cities_data['count']/100000
#cities_data.merge(pop_data, left_on=['object'], right_on=['City'], how='inner')
cities_data
Out[30]:
Ref:https://en.wikipedia.org/wiki/Silver_Spring,_Maryland
Ref:https://en.wikipedia.org/wiki/Unincorporated_community
Bellow there is a distribuition of the amount of crimes per municipalities
In [165]:
fig, axes = plt.subplots(nrows=2, ncols=1,figsize=(40,40))
cities_data['count'].plot(kind='bar',ax=axes[0])
axes[0].set_title('Count of crimes')
Out[165]:
In [32]:
monty_data['Class'].unique
freq_data = monty_data['Class'].value_counts(normalize=True, sort=True, ascending=False)
freq_data.to_frame
chart_data = monty_data['Class'].value_counts(normalize=True,sort=True,ascending=False)#.to_dict(orient='series')
print(chart_data.head(10))
Configuring Bokeh to utilize Google Maps.
In [166]:
map_options2 = GMapOptions(lat=39.151042, lng=-77.193023, map_type="roadmap", zoom=11)
plot2 = GMapPlot(x_range=DataRange1d(), y_range=DataRange1d(), map_options=map_options2)
plot2.title.text = "Montgomery County"
# For GMaps to function, Google requires you obtain and enable an API key:
#
# https://developers.google.com/maps/documentation/javascript/get-api-key
#
# Replace the value below with your personal API key:
plot2.api_key = "AIzaSyBFHmpkUOfk2FtDZXHVBSUUHp6LVPmI-fs"
In [169]:
violent_data = monty_data[(monty_data['MasterClass']==500) | (monty_data['MasterClass']==700) | (monty_data['MasterClass']==800)]
non_violent_data = monty_data[~((monty_data['MasterClass']==500) | (monty_data['MasterClass']==700) | (monty_data['MasterClass']==800))]
source_violent = ColumnDataSource(
data=dict(
lat=violent_data["Latitude"],
lon=violent_data["Longitude"],
)
)
source_non_violent = ColumnDataSource(
data=dict(
lat=non_violent_data["Latitude"],
lon=non_violent_data["Longitude"],
)
)
###print(source.data.values)
circle_red = Circle(x="lon", y="lat", size=3, fill_color="red", fill_alpha=0.8, line_color=None)
circle_blue = Circle(x="lon", y="lat", size=3, fill_color="green", fill_alpha=0.8, line_color=None)
plot2.add_glyph(source_violent, circle_red)
plot2.add_glyph(source_non_violent, circle_blue)
#
plot2.add_tools(PanTool(), WheelZoomTool(), BoxSelectTool())
In [170]:
show(plot2,notebook_handle=True)
Out[170]:
In [155]:
source = ColumnDataSource(
data=dict(
lat=latitude_data[13:130],
lon=longitude_data[13:130],
)
)
print(source.data.values)
circle = Circle(x="lon", y="lat", size=1, fill_color="blue", fill_alpha=0.8, line_color=None)
plot.add_glyph(source, circle)
plot.add_tools(PanTool(), WheelZoomTool(), BoxSelectTool())
Ploting the geographic data in Google Maps. Note that the 'show' function receives another parameter 'notebook_handle=True' responsible for tell Bhoke to do a inline plot
In [58]:
show(plot,notebook_handle=True)
Out[58]:
In [43]:
#unemployment = pd.read_csv("Unemployment_MontgomeryCounty_2005to2017.csv",sep=';')
#unemployment